8th January 2017
You need to select a data set of your own choice (i.e. you may use a dataset already used before in the lab, or from the literature review) for the purposes of building training and validating the above type of classifiers 1-3. With the aid of R package visualise and justify the properties of the selected data set.
The first task was to find out where to look for datasets, most sources of advice referred to the UCI repository, other sources often reused this data and hosted it in a new format, like Kaggle. After looking online for suitable datasets, specifically for the problem of classification, 2 datasets were possible candidates for study; between the Taiwanese Credit Card dataset and the Abalone shellfish dataset, I opted for the Abalone dataset on the basis that if has fewer variables and fewer instances, so it should be easier to work with without so much preparation; the credit card data would require more considered cleansing of outliers and potentially pruning of variables considered as well as PCA for dimensionality reduction.
The Abalone dataset has a little over 4,000 instances compared to 30,000 for the Credit Card dataset, 8 attributes for Abalone compared to 24 for the Credit Card data. Hopefully the dimensionality of the Abalone dataset should lend itself to less data cleansing because the variables are concrete characteristics of the animals:
The typical way to classify this data, is to determine approximate age based on the ring value, according to the UCI page for this dataset, the rings roughly suggest age such that y = r + 2.5, where y is number of years and r is number of rings. The simplest approach is to attempt to guess the correct number of rings for a test instance based on other attributes.
Having downloaded the data from the UCI repository we need convert the data into a table and check it’s usable.
abalone_data = data.frame(read.table("../assets/data/abalone.data", sep = ","))
abalone_attr_names = c("sex", "length", "diameter", "height", "weight.whole", "weight.shucked", "weight.viscera", "weight.shell", "rings")
colnames(abalone_data) = abalone_attr_names
#We need to check there are no missing values, as if any instances are incomlete we will need to remove
missing_vals = sum(is.na(abalone_data))
print(c("Number of missing values:", missing_vals), quote = FALSE)## [1] Number of missing values: 0
Here’s a glance at the dataset.
head(abalone_data)## sex length diameter height weight.whole weight.shucked weight.viscera
## 1 M 0.455 0.365 0.095 0.5140 0.2245 0.1010
## 2 M 0.350 0.265 0.090 0.2255 0.0995 0.0485
## 3 F 0.530 0.420 0.135 0.6770 0.2565 0.1415
## 4 M 0.440 0.365 0.125 0.5160 0.2155 0.1140
## 5 I 0.330 0.255 0.080 0.2050 0.0895 0.0395
## 6 I 0.425 0.300 0.095 0.3515 0.1410 0.0775
## weight.shell rings
## 1 0.150 15
## 2 0.070 7
## 3 0.210 9
## 4 0.155 10
## 5 0.055 7
## 6 0.120 8
abalone_summary = summary(abalone_data)
grid.table(abalone_summary)kable(abalone_summary)| sex | length | diameter | height | weight.whole | weight.shucked | weight.viscera | weight.shell | rings | |
|---|---|---|---|---|---|---|---|---|---|
| F:1307 | Min. :0.075 | Min. :0.0550 | Min. :0.0000 | Min. :0.0020 | Min. :0.0010 | Min. :0.0005 | Min. :0.0015 | Min. : 1.000 | |
| I:1342 | 1st Qu.:0.450 | 1st Qu.:0.3500 | 1st Qu.:0.1150 | 1st Qu.:0.4415 | 1st Qu.:0.1860 | 1st Qu.:0.0935 | 1st Qu.:0.1300 | 1st Qu.: 8.000 | |
| M:1528 | Median :0.545 | Median :0.4250 | Median :0.1400 | Median :0.7995 | Median :0.3360 | Median :0.1710 | Median :0.2340 | Median : 9.000 | |
| NA | Mean :0.524 | Mean :0.4079 | Mean :0.1395 | Mean :0.8287 | Mean :0.3594 | Mean :0.1806 | Mean :0.2388 | Mean : 9.934 | |
| NA | 3rd Qu.:0.615 | 3rd Qu.:0.4800 | 3rd Qu.:0.1650 | 3rd Qu.:1.1530 | 3rd Qu.:0.5020 | 3rd Qu.:0.2530 | 3rd Qu.:0.3290 | 3rd Qu.:11.000 | |
| NA | Max. :0.815 | Max. :0.6500 | Max. :1.1300 | Max. :2.8255 | Max. :1.4880 | Max. :0.7600 | Max. :1.0050 | Max. :29.000 | |
| ### T | he attrib | utes explained |
The Abalone is a type of shellfish and each attribute describes a characteristic of the animal. Not being an expert on this animal, I’ve had to research a little bit about the Abalone in order to get an idea of what these traits are. Knowing a bit more about these properties, should hopefully lead to a better understanding of the data. What follows is a list of definitions for each column.
The data was checked to see if there were any missing values but it’s also worth checking of some values are technically impossible and therefore belonging to incorrect entries that need to be removed; to do this, all values are iterated over to check for assignments of 0, indicating bad data.
abalone_numeric_attr <- abalone_attr_names[which(abalone_attr_names!="sex")]
for (abalone_attr in abalone_numeric_attr)
{
bad_vals = abalone_data[abalone_data[abalone_attr] == 0, ]
#col_data = unlist(abalone_data[abalone_attr])
if (nrow(bad_vals) > 0)
{
print(abalone_attr)
print(abalone_data[abalone_data[abalone_attr] == 0, ])
}
}## [1] "height"
## sex length diameter height weight.whole weight.shucked weight.viscera
## 1258 I 0.430 0.34 0 0.428 0.2065 0.0860
## 3997 I 0.315 0.23 0 0.134 0.0575 0.0285
## weight.shell rings
## 1258 0.1150 8
## 3997 0.3505 6
rm(bad_vals)It appears that there are two height values that are incorrect, so we should remove these instances from our dataset.
abalone_data$height[abalone_data$height==0] = NA
abalone_data = na.omit(abalone_data)It’s also worth examining the weight data, to ensure that the total data is not less than the combined values to the other weight values.
abalone_data_bad_weight = abalone_data[(abalone_data$weight.whole - (abalone_data$weight.shucked + abalone_data$weight.viscera + abalone_data$weight.shell)) < 0,]
head(abalone_data_bad_weight)## sex length diameter height weight.whole weight.shucked weight.viscera
## 43 I 0.240 0.175 0.045 0.0700 0.0315 0.0235
## 44 I 0.205 0.150 0.055 0.0420 0.0255 0.0150
## 45 I 0.210 0.150 0.050 0.0420 0.0175 0.0125
## 46 I 0.390 0.295 0.095 0.2030 0.0875 0.0450
## 47 M 0.470 0.370 0.120 0.5795 0.2930 0.2270
## 61 M 0.450 0.345 0.105 0.4115 0.1800 0.1125
## weight.shell rings
## 43 0.020 5
## 44 0.012 5
## 45 0.015 4
## 46 0.075 7
## 47 0.140 9
## 61 0.135 7
print(paste(c("Number of instances where total weight is less than constiuent weights:", nrow(abalone_data_bad_weight)), sep = ""), quote = FALSE)## [1] Number of instances where total weight is less than constiuent weights:
## [2] 154
It would appear that the data isn’t completely sanitized, with this sort of erroneous data entry needing to be handled as well. The simplest option is to rule out these instances as well.
abalone_data <- abalone_data[!(abalone_data$weight.whole - (abalone_data$weight.shucked + abalone_data$weight.viscera + abalone_data$weight.shell)) < 0,]Below is an examination of each attribute to see if there are any obvious outliers that might need to be removed
#boxplot(scale(abalone_data), main="Looking at the data graphically", xlab="Abalone Attributes", ylab="Values")
plot(abalone_data$sex)for (abalone_attr in abalone_numeric_attr)
{
#print(abalone_attr)
col_data = unlist(abalone_data[abalone_attr])
#print(col_data)
plot(density(col_data), xlab=abalone_attr, main=paste(c("Density of ", abalone_attr), collapse = ''))
}rm(abalone_attr)
rm(col_data)In order to work out which attributes should be considered to have valid outliers, I’ve gone with a heuristic approach, choosing to look at the distance between the uppermost outliers for each attribute and it’s nearest neighbour.
#Create a list to populate with our tail neighbour distances
tail_deltas <- c()
abalone_data_no_sex = abalone_data[, -1]
for (attrib in abalone_data_no_sex) {
#get the last two values
data_tails <- tail(sort(attrib),2)
#push the delta on to our list
tail_deltas <- c(tail_deltas, diff(data_tails))
}
#grab out attribute keys to include in our new table/frame
attributes <- names(abalone_data_no_sex)
#make a new dataframe from
dataframe <- data.frame(attributes = attributes, tail_neighbour_d=tail_deltas)
#get the order for the nearest neighbour starting with the greatest distance and descending
neighbour_order <- order(dataframe$tail_neighbour_d, decreasing=TRUE)
#now apply the order to the frame
sorted_attributes_by_neighbour_d <- dataframe[ neighbour_order, ]
sorted_attributes_by_neighbour_d## attributes tail_neighbour_d
## 8 rings 2.0000
## 3 height 0.6150
## 5 weight.shucked 0.1395
## 6 weight.viscera 0.1185
## 7 weight.shell 0.1080
## 4 weight.whole 0.0460
## 2 diameter 0.0200
## 1 length 0.0150
While rings is at the top of the leader-board regarding the delta, it’s important to take into account that this data isn’t scaled; it’s very likely that this outlier is the results of a particularly lucky Abalone that managed to live longer than the rest of the long tail through some luck. Given that the other values are meant to be in grams and millimetres, it’s reasonable to compare values like for like in this instance. As such the weight deltas as comparable and can be excluded from outlier cleansing, with the same applying to diameter and length; height seems to be anomalous though and will need further attention.
It’s easier to see on a box-plot that one value is way out far from any neighbours, with its nearest neighbour also having no nearby neighbour; we could probably benefit just from removing the two farthest outliers.
boxplot(abalone_data$height)abalone_data_cleansed <- abalone_data[ abalone_data$height < .5, ]
boxplot(abalone_data_cleansed$height)Seeing as the intent is to see if classification can be used to determine approximate age from physical attributes (aside from rings), it’s worth looking for existing correlations between the attributes and the number of rings.
qplot(x = length,
y = rings,
data = abalone_data_cleansed,
alpha = I(0.2), # alpha to help convery density
geom = "jitter") + # jitter so points don't stack so much
geom_smooth(method = lm) qplot(x = diameter,
y = rings,
data = abalone_data_cleansed,
alpha = I(0.2), # alpha to help convery density
geom = "jitter") + # jitter so points don't stack so much
geom_smooth(method = lm) qplot(x = height,
y = rings,
data = abalone_data_cleansed,
alpha = I(0.2), # alpha to help convery density
geom = "jitter") + # jitter so points don't stack so much
geom_smooth(method = lm) qplot(x = weight.whole,
y = rings,
data = abalone_data_cleansed,
alpha = I(0.2), # alpha to help convery density
geom = "jitter") + # jitter so points don't stack so much
geom_smooth(method = lm) qplot(x = weight.shucked,
y = rings,
data = abalone_data_cleansed,
alpha = I(0.2), # alpha to help convery density
geom = "jitter") + # jitter so points don't stack so much
geom_smooth(method = lm) qplot(x = weight.viscera,
y = rings,
data = abalone_data_cleansed,
alpha = I(0.2), # alpha to help convery density
geom = "jitter") + # jitter so points don't stack so much
geom_smooth(method = lm) qplot(x = weight.shell,
y = rings,
data = abalone_data_cleansed,
alpha = I(0.2), # alpha to help convery density
geom = "jitter") + # jitter so points don't stack so much
geom_smooth(method = lm)With our cleansed data we can see that there is an evident correlation between all of these attributes and the number of rings but in particular those relating to physical size show the strongest relationship as the points best match the line of best fit; the weight values are distributed in such a way that there’s a little curve away from the line as you reach 0 on both axes and they also seem to diverge more from the line as the dimension values increase. Should any attributes be selected as ones to work with, discarding the others, it would be:
Given that the all attributes appear to display a fairly linear relationship to the ring count we can use r-squared, otherwise known as the Coefficient of Determination to verify how well the data matches the line of best fit
abalone.lm_length <- lm(data = abalone_data_cleansed, formula = rings ~ length)
abalone.lm_diameter <- lm(data = abalone_data_cleansed, formula = rings ~ diameter)
abalone.lm_height <- lm(data = abalone_data_cleansed, formula = rings ~ height)
abalone.lm_weight.whole <- lm(data = abalone_data_cleansed, formula = rings ~ weight.whole)
abalone.lm_weight.shucked <- lm(data = abalone_data_cleansed, formula = rings ~ weight.shucked)
abalone.lm_weight.viscera <- lm(data = abalone_data_cleansed, formula = rings ~ weight.viscera)
abalone.lm_weight.shell <- lm(data = abalone_data_cleansed, formula = rings ~ weight.shell)
abalone.r_squareds <- c(
summary(abalone.lm_length)$adj.r.squared,
summary(abalone.lm_diameter)$adj.r.squared,
summary(abalone.lm_height)$adj.r.squared,
summary(abalone.lm_weight.whole)$adj.r.squared,
summary(abalone.lm_weight.shucked)$adj.r.squared,
summary(abalone.lm_weight.viscera)$adj.r.squared,
summary(abalone.lm_weight.shell)$adj.r.squared
)
abalone_numeric_attr_no_rings <- abalone_numeric_attr[which(abalone_numeric_attr!="rings")]
#make a new dataframe from
dataframe.rsquareds <- data.frame(attributes = abalone_numeric_attr_no_rings, r_squared=abalone.r_squareds)
#get the order for the nearest neighbour starting with the greatest distance and descending
rsquareds_order <- order(dataframe.rsquareds$r_squared, decreasing=TRUE)
#now apply the order to the frame
sorted_attributes_by_r_squared <- dataframe.rsquareds[ rsquareds_order, ]
sorted_attributes_by_r_squared## attributes r_squared
## 7 weight.shell 0.3855553
## 3 height 0.3604153
## 2 diameter 0.3155304
## 1 length 0.2950611
## 4 weight.whole 0.2806291
## 6 weight.viscera 0.2439019
## 5 weight.shucked 0.1681050
#abalone.lm_lengthAfter looking at these results it may be wiser to consider using the shell weight alone if necessary; the r-squared value isn’t a perfect predictor of the fitness but perhaps in this case, it could make more sense to use a weight value that could fluctuate less in nature. Selection of the attributes relating to size appear to be validated.
With the use of a correlation matrix, it’s possible to further validate which attributes could be worth focusing on:
# calculate correlation matrix
correlationMatrix <- cor(abalone_data_cleansed[,2:9])
# summarize the correlation matrix
print(correlationMatrix)## length diameter height weight.whole weight.shucked
## length 1.0000000 0.9861689 0.8974579 0.9259423 0.9003873
## diameter 0.9861689 1.0000000 0.9040452 0.9260010 0.8952472
## height 0.8974579 0.9040452 1.0000000 0.8873269 0.8361649
## weight.whole 0.9259423 0.9260010 0.8873269 1.0000000 0.9709880
## weight.shucked 0.9003873 0.8952472 0.8361649 0.9709880 1.0000000
## weight.viscera 0.9034261 0.8997239 0.8656499 0.9670606 0.9323660
## weight.shell 0.8977314 0.9055333 0.8897232 0.9561399 0.8830020
## rings 0.5433567 0.5618726 0.6004786 0.5299133 0.4102585
## weight.viscera weight.shell rings
## length 0.9034261 0.8977314 0.5433567
## diameter 0.8997239 0.9055333 0.5618726
## height 0.8656499 0.8897232 0.6004786
## weight.whole 0.9670606 0.9561399 0.5299133
## weight.shucked 0.9323660 0.8830020 0.4102585
## weight.viscera 1.0000000 0.9070347 0.4940547
## weight.shell 0.9070347 1.0000000 0.6210541
## rings 0.4940547 0.6210541 1.0000000
# find attributes that are highly corrected (ideally >0.75)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5)
abalone.correlation <- data.frame(correlation=correlationMatrix[,8])
#get rid of rings, that's obviously going to be max correlation to itself!
abalone.correlation <- data.frame(attributes=rownames(abalone.correlation)[1:7], 'correlation to rings'=abalone.correlation[1:7, 0:1])
#get the order for correlations
correlation_order <- order(abalone.correlation$correlation, decreasing=TRUE)
#now apply the order to the frame
sorted_correlation_order <- abalone.correlation[ correlation_order, 1:2]
print(sorted_correlation_order)## attributes correlation.to.rings
## 7 weight.shell 0.6210541
## 3 height 0.6004786
## 2 diameter 0.5618726
## 1 length 0.5433567
## 4 weight.whole 0.5299133
## 6 weight.viscera 0.4940547
## 5 weight.shucked 0.4102585
The correlation between ring values and the other attributes, when ordered, actually matches the order of the r-squared values; so much so, that weight.shell should be considered the primary correlate of them all, followed by height.
Given that the number of rings is actually a range of integers from 1 to 29, to use each individual possible integer presents a problem; firstly as the ring value cannot be considered continuous it’s not really sensible to treat it as such, thus converting to a a numeric floating range is not appropriate; in addition to this, the dataset might not have instances that cover every single possible ring value between 1 and 29, which will cause inaccuracy in some of the models.
An alternative dataset can be created that mitigates this issue by creating an approximate age factor; classification of the Abalone can be more loosely done against age ranges, like young, middle, old with the test instances still being compared by other attributes and mapped to one of these age group factor values. The split will be detmined by the ring/age spread of the sampled population of Abalones, such that about a 1/3 of the sample size is in each age_group. This is because the number of Abalones with rings above 12 starts to drop off quite accutely and as a percentage those above 15 even are a small minority, despite the highest values nearing 30 rings.
summary(abalone_data_cleansed$rings)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 8 10 10 11 29
quantile(abalone_data_cleansed$rings, c(1/3, 2/3)) ## 33.33333% 66.66667%
## 9 11
abalone_age_groups <- cut(abalone_data_cleansed$rings, breaks=c(-Inf, 9, 12, Inf), labels=c("young","middle","old"))
summary(abalone_age_groups)## young middle old
## 1978 1357 684
abalone_data_cleansed_age_groups <- subset(abalone_data_cleansed, select=-rings)
abalone_data_cleansed_age_groups$age_group <- abalone_age_groups
rm(abalone_age_groups)Assuming we have collected one large dataset of already-classified instances, you need to look at three methods of forming training and test sets from this single dataset in R as described below. • The holdout method • Cross-validation • Leave-one-out cross-validation (Jack Knife)
The holdout method is the most basic separation of the dataset into training and testing data. This method just creates one division, albeit randomly selecting those values from across those dataset rather than in a linear fashion.
set.seed(4321)
head(abalone_data_cleansed)## sex length diameter height weight.whole weight.shucked weight.viscera
## 1 M 0.455 0.365 0.095 0.5140 0.2245 0.1010
## 2 M 0.350 0.265 0.090 0.2255 0.0995 0.0485
## 3 F 0.530 0.420 0.135 0.6770 0.2565 0.1415
## 4 M 0.440 0.365 0.125 0.5160 0.2155 0.1140
## 5 I 0.330 0.255 0.080 0.2050 0.0895 0.0395
## 6 I 0.425 0.300 0.095 0.3515 0.1410 0.0775
## weight.shell rings
## 1 0.150 15
## 2 0.070 7
## 3 0.210 9
## 4 0.155 10
## 5 0.055 7
## 6 0.120 8
holdout.train_indeces <- createDataPartition(y = abalone_data_cleansed$weight.shell, p= 2/3, list = FALSE)
head(holdout.train_indeces, n=20)## Resample1
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [5,] 6
## [6,] 7
## [7,] 8
## [8,] 9
## [9,] 10
## [10,] 13
## [11,] 15
## [12,] 16
## [13,] 20
## [14,] 21
## [15,] 22
## [16,] 23
## [17,] 24
## [18,] 25
## [19,] 26
## [20,] 28
holdout.training <- abalone_data_cleansed[holdout.train_indeces,]
save(holdout.training, file="holdout.training.rda")
holdout.testing <- abalone_data_cleansed[-holdout.train_indeces,]
save(holdout.testing, file="holdout.testing.rda")By looking at the dimensionality it’s possible to confirm that the new datasets are of the correct size
dim(holdout.training)## [1] 2681 9
dim(holdout.testing)## [1] 1338 9
As an addition, the dataset modified to have an age_group factor attribute instead of rings attribute of integers, will also be partitioned for use with the holdout method.
head(abalone_data_cleansed_age_groups)## sex length diameter height weight.whole weight.shucked weight.viscera
## 1 M 0.455 0.365 0.095 0.5140 0.2245 0.1010
## 2 M 0.350 0.265 0.090 0.2255 0.0995 0.0485
## 3 F 0.530 0.420 0.135 0.6770 0.2565 0.1415
## 4 M 0.440 0.365 0.125 0.5160 0.2155 0.1140
## 5 I 0.330 0.255 0.080 0.2050 0.0895 0.0395
## 6 I 0.425 0.300 0.095 0.3515 0.1410 0.0775
## weight.shell age_group
## 1 0.150 old
## 2 0.070 young
## 3 0.210 young
## 4 0.155 middle
## 5 0.055 young
## 6 0.120 young
holdout_age_groups.train_indeces <- createDataPartition(y = abalone_data_cleansed_age_groups$weight.shell, p= 2/3, list = FALSE)
head(holdout_age_groups.train_indeces, n=20)## Resample1
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [5,] 6
## [6,] 8
## [7,] 9
## [8,] 12
## [9,] 13
## [10,] 14
## [11,] 15
## [12,] 16
## [13,] 17
## [14,] 19
## [15,] 21
## [16,] 22
## [17,] 23
## [18,] 24
## [19,] 25
## [20,] 26
holdout_age_groups.training <- abalone_data_cleansed_age_groups[holdout.train_indeces,]
holdout_age_groups.testing <- abalone_data_cleansed_age_groups[-holdout.train_indeces,]Cross-validation methods are a variations on the holdout method. The k-folds cross-validation in particular is an extended holdout method whereby the dataset is chunked into smaller fragments (where the value of k is the fragment count), called ‘folds’ which are each in turn used as the test subset while the remaining folds make up the training subset; in this way, the training is carried out several times over the same dataset, rotating the role of the ‘folds’ such that every instance will be used several times as training data and once as test data.
This form of training makes better use of a small sample size and helps even out any biases that might occur from just taking one partition for training and another for testing. This averaging out of the training and testing, also happens to benefit larger datasets too, so it is generally considered superior to the basic holdout method.
Repeated k-fold cross-validation takes the technique yet another step further, by splitting the dataset into k folds, repeatedly such that different fragmentation occurs each time; that is to say that even though the number of divisions are the same, each repetition creates a different set of subsets. By doing so, this sort of shuffling further economically reuses the dataset for training purposes.
For the sake of evaluating this method more thoroughly, the dataset will be used with k-fold 3 times, using Fibonacci sequence numbers 8 & 13 for k and 5 for the number of iterations of cross-validation. For comparison, a standard k-folds of will also be used
cv.train_control_8 <- trainControl(method="cv", number=8)
cv.train_control_8_5 <- trainControl(method="repeatedcv", number=8, repeats = 5)
cv.train_control_13_5 <- trainControl(method="repeatedcv", number=13, repeats = 5)These training controls will be used later on in this study to train the various models for the task of classification but for now they are merely abstract instructions on how to chunk the data.
Leave-one-out cross-validation is a form of exhaustive cross-validation. Exhaustive cross-validation methods are said to “learn and test on all possible ways to divide the original sample into a training and a validation set” [CITE HERE!]. The leave-one-out method is a specific form of the leave-p-out technique, where instead of determine the test dataset as a fraction of the whole, as is the case with k-fold, p is the absolute count of instances to be used in the test subset. What makes this technique exhaustive is how the p subset is iterated such that every instance will be included in at least one test subset.
Leave-p-out can be computationally expensive because the larger p is, the greater coefficient is for testing and training with the subsets, bearing in mind that for a given size of p, as many possible permutations as can be created for the test subset of this size, from the original sample set, need to be created; this is why leave-one-out might be preferred since a set of one means that there need only be as many test and training sets as the sample size value.
# define training control
loocv.train_control <- trainControl(method="LOOCV")As with the k-folds training controls, the leave-one-out control will be used later on but for now remains an abstract set of instructions on how to chunk the data.
You need to construct, train and test Decision Tree type classifier (C4.5, Random Forest) in R. Train and test your decision tree classifier using the training and test sets generated based on the methods tried as part of the 2nd Task.
Finding the C4.5 method as an existing library within R, brought up more than one option, both appear to use a an open-source equivalent of C4.5 rather than an official C4.5 implementation; other options we to use C5.0 which apparently supersedes C4.5; seeing as the task was to investigate C4.5, J48 has been chosen as a more faithful example of the algorithm.
Below is not only an investigation into C4.5 but also, a comparison of two variations. Initially all available dimensions will be used for training.
load("holdout.training.rda")
load("holdout.testing.rda")
dt.c4_5_j48_h <- J48(as.factor(rings)~., holdout.training)
dt_sum.c4_5_j48_h <- summary(dt.c4_5_j48_h)
dt.c4_5_j48_h_party <- as.party(dt.c4_5_j48_h)The J48 function is extremely quick with the dataset, training takes less than a second, which on it’s on is not necessarily worthy of note but certainly more interesting when compared to the speed of using the caret train method with a “J48” method argument value.
We can look at the complexity of the tree by looking at the dimensionality, where the length is effectively the tree size, width is the number of leaves, or terminal nodes and the depth is effectively the number of conditional branch layers.
length(dt.c4_5_j48_h_party)## [1] 1534
width(dt.c4_5_j48_h_party)## [1] 784
depth(dt.c4_5_j48_h_party)## [1] 22
The complexity of this tree is sufficiently complex to render a graphical representation useless. In fact the tree is complex it’s time intensive as well as being of no value.
dt.c4_5_h2 <- train(as.factor(rings) ~., method="J48", holdout.training, tuneLength = 8)
dt_sum.c4_5_h2 <- summary(dt.c4_5_h2)
dt.c4_5_h2_finalModel <- dt.c4_5_h2$finalModel
dt_sum.c4_5_h2_final <- summary(dt.c4_5_h2_finalModel)Whilst in comparisons to other forms of training and data-mining algorithms, under 5 minutes for a 4,000 by 9 dataset might seem okay, in comparison to the J48 function, there seems to be something at odds. The proof will be in the comparison of the two models with regard to accuracy on the test dataset.
Again, we can look at the complexity of the tree through dimensionality, length being total size, width being leaf nodes and depth being branch layer count.
#dt.c4_5_h2
#dt.c4_5_h2_finalModel <- dt.c4_5_h2$finalModel
#dt.c4_5_h2_finalModel
dt.c4_5_h2_finalModel_party <- as.party(dt.c4_5_h2_finalModel)
#dt.c4_5_h2_finalModel_party <- as.party(dt.c4_5_h2$finalModel)
length(dt.c4_5_h2_finalModel_party)## [1] 89
width(dt.c4_5_h2_finalModel_party)## [1] 45
depth(dt.c4_5_h2_finalModel_party)## [1] 12
It would appear that the ‘final model’ we are looking at derived from the train function has already been pruned, making for a simpler decision tree; this tree is actually able to be represented graphically within a reasonable amount of time (under 10 seconds).
#plot(dt.c4_5_h2_finalModel)
plot(dt.c4_5_h2_finalModel_party)Even though the tree is able to be rendered it’s not easy to get anything meaningful out of this. Perhaps the most pertinent point is that the not only could the classification levels benefit from being simplified but using fewer dimensions for observation would also force a simpler set of conditional branching.
Below follows the summaries from both methods, for examination:
dt_sum.c4_5_j48_h##
## === Summary ===
##
## Correctly Classified Instances 2033 75.8299 %
## Incorrectly Classified Instances 648 24.1701 %
## Kappa statistic 0.7279
## Mean absolute error 0.0234
## Root mean squared error 0.1081
## Relative absolute error 35.265 %
## Root relative squared error 59.4069 %
## Total Number of Instances 2681
##
## === Confusion Matrix ===
##
## a b c d e f g h i j k l m n o p q r s t u v w x y z aa <-- classified as
## 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | a = 2
## 0 5 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | b = 3
## 0 1 31 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | c = 4
## 0 1 4 57 7 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | d = 5
## 0 0 3 5 133 7 2 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | e = 6
## 0 0 1 3 12 203 12 5 5 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | f = 7
## 0 0 2 1 8 22 294 25 12 6 0 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 | g = 8
## 0 0 0 3 4 6 21 363 13 21 2 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 | h = 9
## 0 0 0 0 2 4 16 31 342 11 5 5 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 | i = 10
## 0 0 0 0 3 7 14 13 18 260 5 2 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 | j = 11
## 0 0 0 0 0 1 9 11 15 9 114 3 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 | k = 12
## 0 0 0 0 1 1 4 8 11 12 3 85 1 1 1 0 0 0 1 1 0 0 0 0 0 0 0 | l = 13
## 0 0 0 0 0 4 5 6 11 8 6 4 42 1 2 2 0 0 0 0 0 0 0 0 0 0 0 | m = 14
## 0 0 0 0 0 0 4 4 7 4 5 4 2 29 0 0 0 0 0 1 0 0 0 0 0 0 0 | n = 15
## 0 0 0 0 0 0 1 6 3 3 1 1 3 1 26 0 1 0 0 0 0 0 0 0 0 0 0 | o = 16
## 0 0 0 0 0 0 2 4 1 0 2 1 3 1 0 23 0 0 0 0 0 0 0 0 0 0 0 | p = 17
## 0 0 0 0 0 1 2 1 5 4 1 1 2 2 0 0 10 0 0 0 0 0 0 0 0 0 0 | q = 18
## 0 0 0 0 1 0 1 1 2 3 1 3 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 | r = 19
## 0 0 0 0 0 1 0 0 1 1 2 1 0 0 0 1 0 0 7 0 0 0 0 0 0 0 0 | s = 20
## 0 0 0 0 0 0 0 0 0 3 1 0 1 1 1 0 0 0 0 6 0 0 0 0 0 0 0 | t = 21
## 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | u = 22
## 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 2 0 0 0 0 0 | v = 23
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 | w = 24
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 | x = 25
## 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 | y = 26
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 | z = 27
## 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | aa = 29
dt_sum.c4_5_h2##
## === Summary ===
##
## Correctly Classified Instances 904 33.7188 %
## Incorrectly Classified Instances 1777 66.2812 %
## Kappa statistic 0.2414
## Mean absolute error 0.0578
## Root mean squared error 0.1699
## Relative absolute error 87.138 %
## Root relative squared error 93.3833 %
## Total Number of Instances 2681
##
## === Confusion Matrix ===
##
## a b c d e f g h i j k l m n o p q r s t u v w x y z aa <-- classified as
## 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | a = 2
## 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | b = 3
## 0 0 20 11 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | c = 4
## 0 0 9 40 0 21 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | d = 5
## 0 0 0 34 0 85 20 10 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | e = 6
## 0 0 0 9 0 113 86 26 7 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 | f = 7
## 0 0 0 5 0 48 214 53 45 4 4 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 | g = 8
## 0 0 0 0 0 14 134 122 137 26 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 | h = 9
## 0 0 0 1 0 8 69 42 225 61 5 3 2 0 2 0 0 0 0 0 0 0 0 0 0 0 0 | i = 10
## 0 0 0 0 0 7 42 26 138 102 5 3 1 0 2 1 0 0 0 0 0 0 0 0 0 0 0 | j = 11
## 0 0 0 0 0 0 30 14 48 45 24 0 0 0 2 1 0 1 0 0 0 0 0 0 0 0 0 | k = 12
## 0 0 0 0 0 0 25 16 31 33 6 12 3 0 2 1 0 0 0 0 0 1 0 0 0 0 0 | l = 13
## 0 0 0 0 0 1 16 5 27 23 2 4 12 0 0 0 0 0 0 0 0 1 0 0 0 0 0 | m = 14
## 0 0 0 0 0 0 12 7 15 15 4 0 3 0 2 1 0 1 0 0 0 0 0 0 0 0 0 | n = 15
## 0 0 0 0 0 0 7 4 2 18 2 0 1 0 12 0 0 0 0 0 0 0 0 0 0 0 0 | o = 16
## 0 0 0 0 0 0 2 0 8 11 4 4 4 0 0 4 0 0 0 0 0 0 0 0 0 0 0 | p = 17
## 0 0 0 0 0 0 1 4 6 12 1 1 0 0 3 0 0 0 0 0 0 1 0 0 0 0 0 | q = 18
## 0 0 0 0 0 0 1 1 2 5 2 3 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 | r = 19
## 0 0 0 0 0 0 0 2 1 6 2 2 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 | s = 20
## 0 0 0 0 0 0 1 0 2 7 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 0 0 | t = 21
## 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 | u = 22
## 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 | v = 23
## 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | w = 24
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 | x = 25
## 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | y = 26
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 | z = 27
## 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 | aa = 29
Despite the values being present in the summaries, to clarify understanding, they are repeated below:
dt_sum.c4_5_j48_h$details[1]## pctCorrect
## 75.82991
dt_sum.c4_5_h2$details[1]## pctCorrect
## 33.71876
The J48 function has a significantly higher accuracy compared to the train function J48 call but at this stage it’s hard to be confident this is a good thing; given the difference in tree complexity, it could well be that the J48 function suffers from massive over-fitting, while the train call has done some excessive pruning which has not only accounted for the extra time for the function to complete but also the diminished accuracy. To understand things further it’s really necessary to test the trees against the validation subset.
What follows is the output of testing the models against the test subset.
holdout.test_rings <- holdout.testing$rings
dt.c4_5_j48_h_test <- predict(dt.c4_5_j48_h, newdata = holdout.testing)
holdout.test_levels <- min(holdout.test_rings):max(holdout.test_rings)
dt.c4_5_j48_h_test_cm <- confusionMatrix(factor(dt.c4_5_j48_h_test, levels=holdout.test_levels), factor(holdout.test_rings, levels = holdout.test_levels))
dt.c4_5_j48_h_test_cm$table## Reference
## Prediction 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## 3 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 4 5 4 3 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 1 5 5 9 8 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 5 13 27 18 12 6 8 3 2 0 0 1 0 0 0 0 0 0 0 0
## 7 0 0 4 31 45 30 17 8 2 2 2 0 1 0 0 0 0 0 0 0 0
## 8 0 0 1 9 20 39 47 24 17 6 5 4 2 1 3 0 0 0 0 0 0
## 9 0 0 0 3 12 39 46 40 26 21 4 6 5 0 1 0 1 0 0 0 0
## 10 0 0 0 2 8 21 50 53 35 21 13 12 8 5 2 5 4 3 0 0 0
## 11 0 0 0 3 5 15 41 28 37 20 14 3 6 2 6 2 3 1 0 0 0
## 12 0 0 0 0 2 5 9 19 16 7 8 4 2 2 2 1 0 1 1 2 1
## 13 0 0 0 0 2 3 5 9 8 6 8 1 3 7 1 2 4 1 0 0 0
## 14 0 0 0 0 0 0 0 4 1 2 3 2 4 1 0 0 0 3 0 0 1
## 15 0 0 0 0 0 1 2 5 4 2 4 0 3 0 1 1 1 1 0 0 0
## 16 0 0 0 0 0 0 1 0 0 1 4 3 2 1 2 0 1 0 0 0 0
## 17 0 0 0 1 0 1 1 0 0 2 2 0 0 0 1 0 0 0 0 1 1
## 18 0 0 0 0 0 0 0 1 2 0 1 0 2 0 1 0 1 1 0 0 0
## 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 20 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 0 1 0 0 0
## 21 0 0 0 0 0 0 0 0 0 3 1 0 0 1 0 1 0 0 0 1 0
## 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 23 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## 24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Reference
## Prediction 24 25 26 27
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## 9 0 0 0 0
## 10 0 0 0 0
## 11 0 0 0 0
## 12 0 0 0 0
## 13 0 0 0 0
## 14 0 0 0 0
## 15 0 0 0 0
## 16 0 0 0 1
## 17 0 0 0 0
## 18 0 0 0 0
## 19 0 0 0 0
## 20 0 0 0 0
## 21 0 0 0 0
## 22 0 0 0 0
## 23 0 0 0 0
## 24 0 0 0 0
## 25 0 0 0 0
## 26 0 0 0 0
## 27 0 0 0 0
dt.c4_5_j48_h_test_cm$overall[1]## Accuracy
## 0.2092676
#dt.c4_5_h2
dt.c4_5_h2_test <- predict(dt.c4_5_h2, newdata = holdout.testing)
dt.c4_5_h2_test_cm <- confusionMatrix(factor(dt.c4_5_h2_test, levels=holdout.test_levels), factor(holdout.test_rings, levels = holdout.test_levels))
dt.c4_5_h2_test_cm$table## Reference
## Prediction 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 4 5 8 6 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 5 0 8 15 10 11 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 7 0 0 8 61 53 21 5 5 1 3 1 0 0 0 0 0 0 0 0 0 0
## 8 0 0 0 10 45 78 91 44 32 7 10 8 8 3 2 1 0 0 1 0 0
## 9 0 0 0 3 12 33 35 30 15 16 5 3 4 3 0 1 1 1 0 0 0
## 10 0 0 0 0 3 25 72 82 63 33 21 10 11 3 4 6 5 2 0 1 0
## 11 0 0 0 1 0 6 16 31 36 26 21 5 10 7 12 3 6 5 0 3 1
## 12 0 0 0 0 0 2 4 6 2 6 3 2 0 1 1 0 0 1 0 0 0
## 13 0 0 0 1 0 0 0 2 0 1 0 1 5 0 0 0 0 1 0 0 0
## 14 0 0 0 0 0 0 0 0 1 1 1 3 2 3 1 1 0 0 0 0 2
## 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 16 0 0 0 0 0 0 1 1 1 0 5 1 0 0 1 1 2 0 0 0 0
## 17 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 19 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 2 0 0 0
## 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 23 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 1 0 0 0 0
## 24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 25 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 26 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Reference
## Prediction 24 25 26 27
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 7 0 0 0 0
## 8 0 0 0 0
## 9 0 0 0 0
## 10 0 0 0 0
## 11 0 0 0 1
## 12 0 0 0 0
## 13 0 0 0 0
## 14 0 0 0 0
## 15 0 0 0 0
## 16 0 0 0 0
## 17 0 0 0 0
## 18 0 0 0 0
## 19 0 0 0 0
## 20 0 0 0 0
## 21 0 0 0 0
## 22 0 0 0 0
## 23 0 0 0 0
## 24 0 0 0 0
## 25 0 0 0 0
## 26 0 0 0 0
## 27 0 0 0 0
dt.c4_5_h2_test_cm$overall[1]## Accuracy
## 0.2361734
It would appear that despite the J48 function derived model claiming a higher accuracy, the resulting confusion matrix data suggests otherwise. Neither 23.6% or 20.9 is particularly encouraging, however for the improvement in speed, the loss of about 12.% in accuracy (approximately 3% from 24%), might be reasonable is the overall accuracy can be improved. To investigate this further, the next steps are to look at reducing the number of attributes observed based on the earlier analysis of data, and training the models to achieve simpler classification goals.
To attempt attainment of high accuracy, it’s worth looking to run the very same decision tree functions against a streamline classification aim, with a more targeted formula. Using raw rings and the classification levels amounted to nearly 30 possible outcomes, so bringing that down do a concrete tiered factor of 3 age groups has to make things not only more performant but it’s easier to be more accurate when the intended classification type has a larger scope.
Previously, each of the attributes of a continues numeric type were analysed for a correlation to the number of rings. To have a simpler decision tree it stands to reason that picking only the most relevant attributes for the problem of this particular classification are used. This this end, the attributes selected for the revised formula will be those top three correlates: weight.shell, height, diameter.
dt.formula <- as.formula(age_group ~ weight.shell + height + diameter)dt.c4_5_j48_h_a <- J48(dt.formula, holdout_age_groups.training)
dt_sum.c4_5_j48_h_a <- summary(dt.c4_5_j48_h_a)
dt.c4_5_j48_h_a_party <- as.party(dt.c4_5_j48_h_a)
#
dt_sum.c4_5_j48_h_a##
## === Summary ===
##
## Correctly Classified Instances 1892 70.5707 %
## Incorrectly Classified Instances 789 29.4293 %
## Kappa statistic 0.5053
## Mean absolute error 0.2797
## Root mean squared error 0.374
## Relative absolute error 68.3388 %
## Root relative squared error 82.6712 %
## Total Number of Instances 2681
##
## === Confusion Matrix ===
##
## a b c <-- classified as
## 1084 223 14 | a = young
## 243 630 37 | b = middle
## 79 193 178 | c = old
#
length(dt.c4_5_j48_h_a_party)## [1] 103
width(dt.c4_5_j48_h_a_party)## [1] 52
depth(dt.c4_5_j48_h_a_party)## [1] 11
#
plot(dt.c4_5_j48_h_a_party)dt.c4_5_h2_a <- train(dt.formula, method="J48", holdout_age_groups.training, tuneLength = 8)dt_sum.c4_5_h2_a <- summary(dt.c4_5_h2_a)
dt_sum.c4_5_h2_a_final <- summary(dt.c4_5_h2_a$finalModel)
#
dt_sum.c4_5_h2_a##
## === Summary ===
##
## Correctly Classified Instances 1821 67.9224 %
## Incorrectly Classified Instances 860 32.0776 %
## Kappa statistic 0.4683
## Mean absolute error 0.2995
## Root mean squared error 0.387
## Relative absolute error 73.1807 %
## Root relative squared error 85.5498 %
## Total Number of Instances 2681
##
## === Confusion Matrix ===
##
## a b c <-- classified as
## 1048 229 44 | a = young
## 234 590 86 | b = middle
## 69 198 183 | c = old
#
dt.c4_5_h2_a_finalModel <- dt.c4_5_h2_a$finalModel
dt.c4_5_h2_a_finalModel_party <- as.party(dt.c4_5_h2_a_finalModel)
#
length(dt.c4_5_h2_a_finalModel_party)## [1] 33
width(dt.c4_5_h2_a_finalModel_party)## [1] 17
depth(dt.c4_5_h2_a_finalModel_party)## [1] 8
#
plot(dt.c4_5_h2_a_finalModel_party) ####Accuracy comparison of revised C4.5/J48 models
dt_sum.c4_5_j48_h_a$details[1]## pctCorrect
## 70.57068
dt_sum.c4_5_h2_a$details[1]## pctCorrect
## 67.92242
Interestingly the J48 function has produced an accuracy rating that is only marginally better than the previous version while the train J48 function call seems to have improved significantly so that it’s nearly on par with the J48 function result. Looking at the models after validation has happened will hopefully provide even more revealing findings.
What follows is the output of testing the models against the test subset.
#holdout.test_rings <- holdout.testing$rings
dt.c4_5_j48_h_a_test <- predict(dt.c4_5_j48_h_a, newdata = holdout_age_groups.testing)
#holdout.test_levels <- min(holdout.test_rings):max(holdout.test_rings)
#dt.c4_5_j48_h_a_test_cm <- confusionMatrix(factor(dt.c4_5_j48_h_test, levels=holdout.test_levels), factor(holdout.test_rings, levels = holdout.test_levels))
dt.c4_5_j48_h_a_test_cm <- confusionMatrix(dt.c4_5_j48_h_a_test, holdout_age_groups.testing$age_group)
dt.c4_5_j48_h_a_test_cm$table## Reference
## Prediction young middle old
## young 517 147 34
## middle 119 267 138
## old 21 33 62
dt.c4_5_j48_h_a_test_cm$overall[1]## Accuracy
## 0.632287
#dt.c4_5_h2
dt.c4_5_h2_a_test <- predict(dt.c4_5_h2_a, newdata = holdout_age_groups.testing)
dt.c4_5_h2_a_test_cm <- confusionMatrix(dt.c4_5_h2_a_test, holdout_age_groups.testing$age_group)
dt.c4_5_h2_a_test_cm$table## Reference
## Prediction young middle old
## young 510 134 27
## middle 120 263 123
## old 27 50 84
dt.c4_5_h2_a_test_cm$overall[1]## Accuracy
## 0.6405082
The accuracy has improved markedly so it’s safe to say that the combination of streamlining the formula and creating a simpler classification requirement has improved things; this is the new baseline, now it’s worth looking at any improvement that can be made through using the k-folds and leave-one-out cross-validation techniques.
Given how the accuracy between the two form of C4.5 model generation narrowed to an absolute percentage delta of lest than 1 percent, coupled with the speed at which the J48 function returns, the next phase of experimentation will occur only with this function and the relevant training control options.
The J48 function does not accept the caret training control objects as valid control parameters; calling the Weka function WOW with "J48" as the sole argument presents the list of arguments that can be passed in to a Weka_control function call to configure training.
WOW("J48")## -U Use unpruned tree.
## -O Do not collapse tree.
## -C <pruning confidence>
## Set confidence threshold for pruning. (default 0.25)
## Number of arguments: 1.
## -M <minimum number of instances>
## Set minimum number of instances per leaf. (default 2)
## Number of arguments: 1.
## -R Use reduced error pruning.
## -N <number of folds>
## Set number of folds for reduced error pruning. One fold is
## used as pruning set. (default 3)
## Number of arguments: 1.
## -B Use binary splits only.
## -S Do not perform subtree raising.
## -L Do not clean up after the tree has been built.
## -A Laplace smoothing for predicted probabilities.
## -J Do not use MDL correction for info gain on numeric
## attributes.
## -Q <seed>
## Seed for random data shuffling (default 1).
## Number of arguments: 1.
## -doNotMakeSplitPointActualValue
## Do not make split point actual value.
## -output-debug-info
## If set, classifier is run in debug mode and may output
## additional info to the console
## -do-not-check-capabilities
## If set, classifier capabilities are not checked before
## classifier is built (use with caution).
## -num-decimal-places
## The number of decimal places for the output of numbers in
## the model (default 2).
## Number of arguments: 1.
## -batch-size
## The desired batch size for batch prediction (default 100).
## Number of arguments: 1.
Unfortunately, for the performant J48 method, none of these options seems to allow for custom methods of training. At this point, J48 has to be disregarded due to inflexibility despite such good run-time training speeds.
Three types of k-folds cross-validation configurations were created; at this point it’s opportune to examine which, if any, are able to improve the accuracy of training and validation. Each of the three types of k-folds configurations will be used to train the decision tree.
The imporant implementation detail to emphasise here is that while the holdout method required a one-off expicit call to partition the data, with the user assigning which part as training or testing subset, with cross-validation the entire dataset is passed into the traniing call; this is because the training will occur several times using many subsampled testing subsets from the original sample. For the sake of completeness, the holdout testing suset can still be used to further interrogate the model, in order to better compare against other classifiers.
K-folds, with 8 folds
#dt.c4_5_j48_kf_a <- train(dt.formula, method="J48", holdout_age_groups.training, tuneLength = 8, trControl = cv.train_control_8)
#abalone_data_cleansed_age_groups
dt.c4_5_j48_kf_a <- train(dt.formula, method="J48", abalone_data_cleansed_age_groups, tuneLength = 8, trControl = cv.train_control_8)K-folds, with 8 folds, 5 repetitions
#dt.c4_5_j48_r5kf8_a <- train(dt.formula, method="J48", holdout_age_groups.training, tuneLength = 8, trControl = cv.train_control_8_5)
#abalone_data_cleansed_age_groups
dt.c4_5_j48_r5kf8_a <- train(dt.formula, method="J48", abalone_data_cleansed_age_groups, tuneLength = 8, trControl = cv.train_control_8_5)K-folds, with 13 folds, 5 repetitions
#dt.c4_5_j48_r5kf13_a <- train(dt.formula, method="J48", holdout_age_groups.training, tuneLength = 8, trControl = cv.train_control_13_5)
#abalone_data_cleansed_age_groups
dt.c4_5_j48_r5kf13_a <- train(dt.formula, method="J48", abalone_data_cleansed_age_groups, tuneLength = 8, trControl = cv.train_control_13_5)Below is a comparison of the different training results for k-folds; ultimately, the best one will be picked as the preferred use of k-folds going forward; obviously should this preferred configuration turn out to be too time intensive with other types of classifier, falling back to another configuration will be considered.
K-folds, with 8 folds
dt_sum.c4_5_j48_kf_a <- summary(dt.c4_5_j48_kf_a)
dt.c4_5_j48_kf_a_final <- summary(dt.c4_5_j48_kf_a$finalModel)
#
dt_sum.c4_5_j48_kf_a##
## === Summary ===
##
## Correctly Classified Instances 2771 68.9475 %
## Incorrectly Classified Instances 1248 31.0525 %
## Kappa statistic 0.4824
## Mean absolute error 0.2887
## Root mean squared error 0.3799
## Relative absolute error 70.4376 %
## Root relative squared error 83.9297 %
## Total Number of Instances 4019
##
## === Confusion Matrix ===
##
## a b c <-- classified as
## 1647 268 63 | a = young
## 402 831 124 | b = middle
## 95 296 293 | c = old
#
dt.c4_5_j48_kf_a_finalModel <- dt.c4_5_h2_a$finalModel
dt.c4_5_j48_kf_a_finalModel_party <- as.party(dt.c4_5_h2_a$finalModel)
#
length(dt.c4_5_j48_kf_a_finalModel_party)## [1] 33
width(dt.c4_5_j48_kf_a_finalModel_party)## [1] 17
depth(dt.c4_5_j48_kf_a_finalModel_party)## [1] 8
#
plot(dt.c4_5_j48_kf_a_finalModel_party)K-folds, with 8 folds, 5 repetitions
dt_sum.c4_5_r5kf8_a <- summary(dt.c4_5_j48_r5kf8_a)
dt.c4_5_j48_r5kf8_a_final <- summary(dt.c4_5_j48_r5kf8_a$finalModel)
#
dt_sum.c4_5_r5kf8_a##
## === Summary ===
##
## Correctly Classified Instances 2694 67.0316 %
## Incorrectly Classified Instances 1325 32.9684 %
## Kappa statistic 0.4385
## Mean absolute error 0.3154
## Root mean squared error 0.3971
## Relative absolute error 76.9366 %
## Root relative squared error 87.7163 %
## Total Number of Instances 4019
##
## === Confusion Matrix ===
##
## a b c <-- classified as
## 1696 239 43 | a = young
## 516 746 95 | b = middle
## 168 264 252 | c = old
#
dt.c4_5_j48_r5kf8_a_finalModel <- dt.c4_5_j48_r5kf8_a$finalModel
dt.c4_5_j48_r5kf8_a_finalModel_party <- as.party(dt.c4_5_j48_r5kf8_a_finalModel)
#
length(dt.c4_5_j48_r5kf8_a_finalModel_party)## [1] 39
width(dt.c4_5_j48_r5kf8_a_finalModel_party)## [1] 20
depth(dt.c4_5_j48_r5kf8_a_finalModel_party)## [1] 9
#
plot(dt.c4_5_j48_r5kf8_a_finalModel_party)K-folds, with 13 folds, 5 repetitions
dt_sum.c4_5_r5kf13_a <- summary(dt.c4_5_j48_r5kf13_a)
dt.c4_5_j48_r5kf13_a_final <- summary(dt.c4_5_j48_r5kf13_a$finalModel)
#
dt_sum.c4_5_r5kf13_a##
## === Summary ===
##
## Correctly Classified Instances 2731 67.9522 %
## Incorrectly Classified Instances 1288 32.0478 %
## Kappa statistic 0.4628
## Mean absolute error 0.2957
## Root mean squared error 0.3845
## Relative absolute error 72.1388 %
## Root relative squared error 84.9372 %
## Total Number of Instances 4019
##
## === Confusion Matrix ===
##
## a b c <-- classified as
## 1655 264 59 | a = young
## 436 801 120 | b = middle
## 113 296 275 | c = old
#
dt.c4_5_j48_r5kf13_a_finalModel <- dt.c4_5_j48_r5kf13_a$finalModel
dt.c4_5_j48_r5kf13_a_finalModel_party <- as.party(dt.c4_5_j48_r5kf13_a_finalModel)
#
length(dt.c4_5_j48_r5kf13_a_finalModel_party)## [1] 59
width(dt.c4_5_j48_r5kf13_a_finalModel_party)## [1] 30
depth(dt.c4_5_j48_r5kf13_a_finalModel_party)## [1] 9
#
plot(dt.c4_5_j48_r5kf13_a_finalModel_party)dt_sum.c4_5_j48_kf_a$details[1]## pctCorrect
## 68.9475
dt_sum.c4_5_r5kf8_a$details[1]## pctCorrect
## 67.0316
dt_sum.c4_5_r5kf13_a$details[1]## pctCorrect
## 67.95223
After having run the training, the preliminary results suggest that while repeating the whole k-folds process a number of times can add some more accuracy, increasing the k count doesn’t add much more; depending on the size of the real-world dataset and the economic benefit of even the slightest improvement in accuracy, it’s arguable that a tradeoff between accuracy and performance can be made and sometimes a quicker less accurate option is the right tool for the job. That being said, prediction is significantly faster than training for decision trees; once the model has been built, new data is just processed into a class.
The fact that for the k-folds with 5 repititions, the k = 13 version took about twice as long to compute as the k = 8 version, the 8 version will be considered the preffered choice, assuming the validation doesn’t suggest otherwise.
K-folds, with 8 folds
dt.c4_5_j48_kf_a_test <- predict(dt.c4_5_j48_kf_a, newdata = holdout_age_groups.testing)
dt.c4_5_j48_kf_a_test_cm <- confusionMatrix(dt.c4_5_j48_kf_a_test, holdout_age_groups.testing$age_group)
dt.c4_5_j48_kf_a_test_cm$table## Reference
## Prediction young middle old
## young 543 146 28
## middle 90 257 116
## old 24 44 90
dt.c4_5_j48_kf_a_test_cm$overall[1]## Accuracy
## 0.6651719
K-folds, with 8 folds, 5 repetitions
dt.c4_5_j48_r5kf8_a_test <- predict(dt.c4_5_j48_r5kf8_a, newdata = holdout_age_groups.testing)
dt.c4_5_j48_r5kf8_a_test_cm <- confusionMatrix(dt.c4_5_j48_r5kf8_a_test, holdout_age_groups.testing$age_group)
dt.c4_5_j48_r5kf8_a_test_cm$table## Reference
## Prediction young middle old
## young 565 186 47
## middle 78 227 108
## old 14 34 79
dt.c4_5_j48_r5kf8_a_test_cm$overall[1]## Accuracy
## 0.6509716
K-folds, with 13 folds, 5 repetitions
dt.c4_5_j48_r5kf13_a_test <- predict(dt.c4_5_j48_r5kf13_a, newdata = holdout_age_groups.testing)
dt.c4_5_j48_r5kf13_a_test_cm <- confusionMatrix(dt.c4_5_j48_r5kf13_a_test, holdout_age_groups.testing$age_group)
dt.c4_5_j48_r5kf13_a_test_cm$table## Reference
## Prediction young middle old
## young 547 159 32
## middle 88 244 114
## old 22 44 88
dt.c4_5_j48_r5kf13_a_test_cm$overall[1]## Accuracy
## 0.6569507
The last C4.5 decision tree to explore is a model using the leave-one-out training method, however research and experimentation suggests this is too computationally expensive for decision trees; when attempting to train a tree with this training control, over an hour passed without any output; it would appear that for trees, not only p but n (the total number of observations) should also be very low, certainly smaller than the Abalone dataset.
…
TODO! … Add intro here
dt.rf_h_a <- train(dt.formula, data = holdout_age_groups.training, method = "rf", prox= TRUE)## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
dt.rf_h_a_finalModel <- dt.rf_h_a$finalModel
dt_sum.rf_h_a <- summary(dt.rf_h_a)
#
dt.rf_h_a_finalModel$forest[11]## $nrnodes
## [1] 1559
dt.rf_h_a_finalModel$forest[12]## $ntree
## [1] 500
#
print(dt.rf_h_a)## Random Forest
##
## 2681 samples
## 3 predictors
## 3 classes: 'young', 'middle', 'old'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2681, 2681, 2681, 2681, 2681, 2681, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.6078591 0.3560456
## 3 0.6026535 0.3492116
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
print(dt.rf_h_a_finalModel)##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 39.16%
## Confusion matrix:
## young middle old class.error
## young 994 261 66 0.2475397
## middle 289 467 154 0.4868132
## old 74 206 170 0.6222222
plot(dt.rf_h_a_finalModel)dt.rf_h_a_test <- predict(dt.rf_h_a, newdata = holdout_age_groups.testing)
dt.c4_5_j48_h_a_test_cm <- confusionMatrix(dt.rf_h_a_test, holdout_age_groups.testing$age_group)
dt.c4_5_j48_h_a_test_cm$table## Reference
## Prediction young middle old
## young 503 151 37
## middle 118 218 115
## old 36 78 82
dt.c4_5_j48_h_a_test_cm$overall[1]## Accuracy
## 0.6001495
TODO: open with premise for fandom forest that includes avoiding over-fitting.
dt.rf_cv_r5kf8_a <- train(dt.formula, data = abalone_data_cleansed_age_groups, method = "rf", prox= TRUE, trControl = cv.train_control_8_5)## note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
dt.rf_cv_r5kf8_a_finalModel <- dt.rf_cv_r5kf8_a$finalModel
dt_sum.rf_cv_r5kf8_a <- summary(dt.rf_cv_r5kf8_a)
#
dt.rf_cv_r5kf8_a_finalModel$forest[11]## $nrnodes
## [1] 2343
dt.rf_cv_r5kf8_a_finalModel$forest[12]## $ntree
## [1] 500
#
print(dt.rf_cv_r5kf8_a)## Random Forest
##
## 4019 samples
## 3 predictors
## 3 classes: 'young', 'middle', 'old'
##
## No pre-processing
## Resampling: Cross-Validated (8 fold, repeated 5 times)
## Summary of sample sizes: 3517, 3515, 3518, 3515, 3517, 3516, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.6021386 0.3466302
## 3 0.5988570 0.3424393
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
print(dt.rf_cv_r5kf8_a_finalModel)##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 39.86%
## Confusion matrix:
## young middle old class.error
## young 1483 406 89 0.2502528
## middle 437 679 241 0.4996315
## old 113 316 255 0.6271930
plot(dt.rf_cv_r5kf8_a_finalModel)… Based on the previous perfermances for k-folds cross-validation, only one configuarion has been used for the ske of the random forest tree, being the one considered the happy medium of the original 3.
dt.rf_cv_r5kf8_a_test <- predict(dt.rf_cv_r5kf8_a, newdata = holdout_age_groups.testing)
dt.rf_cv_r5kf8_a_test_cm <- confusionMatrix(dt.rf_cv_r5kf8_a_test, holdout_age_groups.testing$age_group)
dt.rf_cv_r5kf8_a_test_cm$table## Reference
## Prediction young middle old
## young 650 4 3
## middle 5 440 4
## old 2 3 227
dt.rf_cv_r5kf8_a_test_cm$overall[1]## Accuracy
## 0.9843049
… TODO: write up about the performance
As with the C4.5 decisiotn tree, Leave-one-out Cross-validation is not suitable for this type of model.
Random forests are really good for avoiding over-fitting and can improve real world accuracy
You need to construct, train and test Naïve Bayes type classifier in R. Train and test your Naïve Byes classifier using the training and test sets generated based on the methods tried as part of the 2nd Task.
Naive Bayes is a relatively quick and simple probabilistic classifier that is often used as a benchmark for other forms of classification; if another technique is in some way superior, be that in terms of speed or accuracy, then it’s worth sharing. If a proposed algorithm cannot improve on Naive Bayes, then it needs further work, or consigned to only be of use to a very niche problem or set aside completely.
Precedent was set in the previous task to use a specific formula to target key predictors (weight.shell, height, diameter) and classify data instances based on a factor representing simplified age grouping; that precedent applies to the modelling in this chapter and onwards.
nb.h_a <- NaiveBayes(dt.formula, data=holdout_age_groups.training)plot(nb.h_a)nb.h_a_test <- predict(nb.h_a, holdout_age_groups.testing)
# summarize results
nb.h_a_test_cm <- confusionMatrix(nb.h_a_test$class, holdout_age_groups.testing$age_group)
nb.h_a_test_cm## Confusion Matrix and Statistics
##
## Reference
## Prediction young middle old
## young 468 121 45
## middle 186 293 149
## old 3 33 40
##
## Overall Statistics
##
## Accuracy : 0.5987
## 95% CI : (0.5718, 0.6251)
## No Information Rate : 0.491
## P-Value [Acc > NIR] : 1.811e-15
##
## Kappa : 0.3318
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: young Class: middle Class: old
## Sensitivity 0.7123 0.6555 0.1709
## Specificity 0.7562 0.6240 0.9674
## Pos Pred Value 0.7382 0.4666 0.5263
## Neg Pred Value 0.7315 0.7831 0.8463
## Prevalence 0.4910 0.3341 0.1749
## Detection Rate 0.3498 0.2190 0.0299
## Detection Prevalence 0.4738 0.4694 0.0568
## Balanced Accuracy 0.7343 0.6397 0.5692
nb.r5kf8_a <- train(dt.formula, data=abalone_data_cleansed_age_groups, trControl = cv.train_control_8_5, method="nb")plot(nb.r5kf8_a)nb.r5kf8_a_test <- predict(nb.r5kf8_a, holdout_age_groups.testing)
# summarize results
nb.r5kf8_a_test_cm <- confusionMatrix(nb.r5kf8_a_test, holdout_age_groups.testing$age_group)
nb.r5kf8_a_test_cm## Confusion Matrix and Statistics
##
## Reference
## Prediction young middle old
## young 481 128 53
## middle 174 288 145
## old 2 31 36
##
## Overall Statistics
##
## Accuracy : 0.6016
## 95% CI : (0.5748, 0.628)
## No Information Rate : 0.491
## P-Value [Acc > NIR] : 3.029e-16
##
## Kappa : 0.3321
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: young Class: middle Class: old
## Sensitivity 0.7321 0.6443 0.15385
## Specificity 0.7342 0.6420 0.97011
## Pos Pred Value 0.7266 0.4745 0.52174
## Neg Pred Value 0.7396 0.7825 0.84397
## Prevalence 0.4910 0.3341 0.17489
## Detection Rate 0.3595 0.2152 0.02691
## Detection Prevalence 0.4948 0.4537 0.05157
## Balanced Accuracy 0.7332 0.6431 0.56198
nb.loo_a <- train(dt.formula, data=abalone_data_cleansed_age_groups, trControl = loocv.train_control, method="nb")plot(nb.loo_a)nb.loo_a_test <- predict(nb.loo_a, holdout_age_groups.testing)
# summarize results
nb.loo_a_test_cm <- confusionMatrix(nb.loo_a_test, holdout_age_groups.testing$age_group)
nb.loo_a_test_cm## Confusion Matrix and Statistics
##
## Reference
## Prediction young middle old
## young 481 128 53
## middle 174 288 145
## old 2 31 36
##
## Overall Statistics
##
## Accuracy : 0.6016
## 95% CI : (0.5748, 0.628)
## No Information Rate : 0.491
## P-Value [Acc > NIR] : 3.029e-16
##
## Kappa : 0.3321
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: young Class: middle Class: old
## Sensitivity 0.7321 0.6443 0.15385
## Specificity 0.7342 0.6420 0.97011
## Pos Pred Value 0.7266 0.4745 0.52174
## Neg Pred Value 0.7396 0.7825 0.84397
## Prevalence 0.4910 0.3341 0.17489
## Detection Rate 0.3595 0.2152 0.02691
## Detection Prevalence 0.4948 0.4537 0.05157
## Balanced Accuracy 0.7332 0.6431 0.56198
nb.h_a_test_cm$overall[1]## Accuracy
## 0.5986547
nb.r5kf8_a_test_cm$overall[1]## Accuracy
## 0.6016442
nb.loo_a_test_cm$overall[1]## Accuracy
## 0.6016442
You need to construct, train and test K-NN type classifier in R. Train and test your K-NN classifier using the training and test sets generated based on the methods tried as part of the 2nd Task.
… TODO Describe KNN a bit
knn.h_a <- train(dt.formula, data=holdout_age_groups.training, method = "knn")plot(knn.h_a)knn.h_a_test <- predict(knn.h_a, holdout_age_groups.testing)
# summarize results
knn.h_a_test_cm <- confusionMatrix(knn.h_a_test, holdout_age_groups.testing$age_group)
knn.h_a_test_cm## Confusion Matrix and Statistics
##
## Reference
## Prediction young middle old
## young 526 159 33
## middle 110 233 130
## old 21 55 71
##
## Overall Statistics
##
## Accuracy : 0.6203
## 95% CI : (0.5937, 0.6464)
## No Information Rate : 0.491
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3664
## Mcnemar's Test P-Value : 4.017e-09
##
## Statistics by Class:
##
## Class: young Class: middle Class: old
## Sensitivity 0.8006 0.5213 0.30342
## Specificity 0.7181 0.7306 0.93116
## Pos Pred Value 0.7326 0.4926 0.48299
## Neg Pred Value 0.7887 0.7526 0.86314
## Prevalence 0.4910 0.3341 0.17489
## Detection Rate 0.3931 0.1741 0.05306
## Detection Prevalence 0.5366 0.3535 0.10987
## Balanced Accuracy 0.7593 0.6259 0.61729
knn.r5kf8_a <- train(dt.formula, data=abalone_data_cleansed_age_groups, trControl = cv.train_control_8_5, method = "knn")plot(knn.r5kf8_a)knn.r5kf8_a_test <- predict(knn.r5kf8_a, holdout_age_groups.testing)
# summarize results
knn.r5kf8_a_test_cm <- confusionMatrix(knn.r5kf8_a_test, holdout_age_groups.testing$age_group)
knn.r5kf8_a_test_cm## Confusion Matrix and Statistics
##
## Reference
## Prediction young middle old
## young 550 130 37
## middle 91 276 98
## old 16 41 99
##
## Overall Statistics
##
## Accuracy : 0.6913
## 95% CI : (0.6658, 0.716)
## No Information Rate : 0.491
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4859
## Mcnemar's Test P-Value : 2.133e-08
##
## Statistics by Class:
##
## Class: young Class: middle Class: old
## Sensitivity 0.8371 0.6174 0.42308
## Specificity 0.7548 0.7879 0.94837
## Pos Pred Value 0.7671 0.5935 0.63462
## Neg Pred Value 0.8277 0.8041 0.88579
## Prevalence 0.4910 0.3341 0.17489
## Detection Rate 0.4111 0.2063 0.07399
## Detection Prevalence 0.5359 0.3475 0.11659
## Balanced Accuracy 0.7960 0.7027 0.68572
knn.loo_a <- train(dt.formula, data=abalone_data_cleansed_age_groups, trControl = loocv.train_control, method="knn")plot(knn.loo_a)knn.loo_a_test <- predict(knn.loo_a, holdout_age_groups.testing)
# summarize results
knn.loo_a_test_cm <- confusionMatrix(knn.loo_a_test, holdout_age_groups.testing$age_group)
knn.loo_a_test_cm## Confusion Matrix and Statistics
##
## Reference
## Prediction young middle old
## young 548 132 36
## middle 92 277 99
## old 17 38 99
##
## Overall Statistics
##
## Accuracy : 0.6906
## 95% CI : (0.665, 0.7153)
## No Information Rate : 0.491
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4845
## Mcnemar's Test P-Value : 6.183e-09
##
## Statistics by Class:
##
## Class: young Class: middle Class: old
## Sensitivity 0.8341 0.6197 0.42308
## Specificity 0.7533 0.7856 0.95018
## Pos Pred Value 0.7654 0.5919 0.64286
## Neg Pred Value 0.8248 0.8046 0.88598
## Prevalence 0.4910 0.3341 0.17489
## Detection Rate 0.4096 0.2070 0.07399
## Detection Prevalence 0.5351 0.3498 0.11510
## Balanced Accuracy 0.7937 0.7027 0.68663
knn.h_a_test_cm$overall[1]## Accuracy
## 0.6203288
knn.r5kf8_a_test_cm$overall[1]## Accuracy
## 0.6913303
knn.loo_a_test_cm$overall[1]## Accuracy
## 0.690583
For each type of classifier calculate and display the following performance related metrics in R. Use the library library(ROCR)
TBD